K-Local Hyperplane and Convex Distance Nearest Neighbor Algorithms

نویسندگان

  • Pascal Vincent
  • Yoshua Bengio
چکیده

Guided by an initial idea of building a complex (non linear) decision surface with maximal local margin in input space, we give a possible geometrical intuition as to why K-Nearest Neighbor (KNN) algorithms often perform more poorly than SVMs on classification tasks. We then propose modified K-Nearest Neighbor algorithms to overcome the perceived problem. The approach is similar in spirit to Tangent Distance, but with invariances inferred from the local neighborhood rather than prior knowledge. Experimental results on real world classification tasks suggest that the modified KNN algorithms often give a dramatic improvement over standard KNN and perform as well or better than SVMs. 1 Motivation The notion of margin for classification tasks has been largely popularized by the success of the Support Vector Machine (SVM) [1, 10] approach. The margin of SVMs has a nice geometric interpretation1: it can be defined informally as (twice) the smallest Euclidean distance between the decision surface and the closest training point. The decision surface produced by the original SVM algorithm is the hyperplane that maximizes this distance while still correctly separating the two classes. While the notion of keeping the largest possible safety margin between the decision surface and the data points seems very reasonable and intuitively appealing, questions arise when extending the approach to building more complex, non-linear decision surfaces. . . Non-linear SVMs usually use the “kernel trick” to achieve their non-linearity. This conceptually corresponds to first mapping the input into a higher-dimensional feature space with some non-linear transformation and building a maximum-margin hyperplane (a linear decision surface) there. The “trick” is that this mapping is never computed directly, but implicitly induced by a kernel. In this setting, the margin being maximized is still the smallest Euclidean distance between the decision surface and the training points, but this time measured in some strange, sometimes infinite dimensional, kernel-induced feature space rather than the original input space. It is less clear whether maximizing the margin in this new space, is meaningful in general. Indeed [11] shows cases where for any separating decision surface in input space there is a feature space in which the corresponding decision surface is a maximum margin hyperplane. for the purpose of this discussion, we consider the original hard-margin SVM algorithm for two linearly separable classes. A different approach is to try and build a non-linear decision surface with maximal distance to the closest data point as measured directly in input space. We could for instance restrict ourselves to a certain class of decision functions and try to find the function with maximal margin among this class. But let us take this even further. Extending the idea of building a correctly separating non-linear decision surface as far away as possible from the data points, we define the notion of local margin as the Euclidean distance, in input space, between a given point on the decision surface and the closest training point. Now would it be possible to find an algorithm that could produce a decision surface which correctly separates the classes and such that the local margin is everywhere maximal along its surface? Surprisingly, the plain old Nearest Neighbor algorithm (1NN) [4] does precisely this2! So why does 1NN in practice often perform worse than SVMs? One typical explanation, is that it has too much capacity, compared to SVM, that the class of function it can produce is too rich. But, considering it has infinite capacity, 1NN is still performing quite well. This study is an attempt to better understand what is happening, based on geometrical intuition, and to derive an improved Nearest Neighbor algorithm from this understanding. 2 Fixing a broken Nearest Neighbor algorithm 2.1 Setting and definitions The setting is that of a classical classification problem in IR (the input space). We are given a training set S of l points {x1, . . . , xl}, xi ∈ IR n and their corresponding class label {y1 = y(x1), . . . , yl = y(xl)}, yi ∈ C, C = {1, . . . , Nc} where Nc is the number of different classes. The (x, y) pairs are assumed to be samples drawn from an unknown distribution P (X,Y ). Barring duplicate inputs, the class labels associated to each x ∈ S define a partition of S: let Sc = {x ∈ S | y(x) = c}. The problem is to find a decision function f̃ : IR → C that will generalize well on new points drawn from P (X,Y ). f̃ should ideally minimize the expected classification error, i.e. minimize EP [If̃(X)6=Y ] where EP denotes the expectation with respect to P (X,Y ) and If̃(x)6=y denotes the indicator function, whose value is 1 if f̃(x) 6= y and 0 otherwise. In the previous and following discussion, we often refer to the concept of decision surface, also known as decision boundary. The function f̃ corresponding to a given algorithm defines for any class c ∈ C two regions of the input space: the region Rc = {x ∈ IR n | f̃(x) = c} and its complement IR − Rc. The decision surface for class c is the interface of these two regions, and can be seen as a n− 1 dimensional manifold (a “surface” in IR) possibly made of several disconnected components. For simplicity, when we mention the decision surface in our discussion we consider only the case of two class discrimination, in which there is a single decision surface. When we mention a test point, we mean a point x ∈ IR that does not belong to the training set S and for which the algorithm is to decide on a class f̃(x). By distance, we mean the usual Euclidean distance in input-space IR. The distance between two points a and b will be written d(a, b) or alternatively ‖a− b‖. The distance between a single point x and a set of points S is the distance to the closest point of the set: d(x, S) = minp∈S d(x, p). The K-neighborhoodV(x) of a test point x is the set of the K points of S whose distance to x is smallest. The K-c-neighborhoodV c (x) of a test point x is the set of K points of Sc whose distance to x is smallest. A formal proof of this is beyond the scope of this article. By Nearest Neighbor algorithm (1NN) we mean the following algorithm: the class of a test point x is decided to be the same as the class of its closest neighbor in S. By K-Nearest Neighbor algorithm (KNN) we mean the following algorithm: the class of a test point x is decided to be the same as the class appearing most frequently among the K-neighborhood of x.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

-Local Hyperplane Distance Nearest-Neighbor Algorithm and Protein Fold Recognition

Two proteins may be structurally similar but not have significant sequence similarity. Protein fold recognition is an approach usually applied in this case. It does not rely on sequence similarity and can be achieved with relevant features extracted from protein sequences. In this paper, we experiment with the K -local hyperplane distance nearest-neighbor algorithm [8] applied to the protein fo...

متن کامل

Classification by ALH-Fast Algorithm

The adaptive local hyperplane (ALH) algorithm is a very recently proposed classifier, which has been shown to perform better than many other benchmarking classifiers including support vector machine (SVM), K-nearest neighbor (KNN), linear discriminant analysis (LDA), and K-local hyperplane distance nearest neighbor (HKNN) algorithms. Although the ALH algorithm is well formulated and despite the...

متن کامل

Protein Fold Recognition with K-Local Hyperplane Distance Nearest Neighbor Algorithm

This paper deals with protein structure analysis, which is useful for understanding function of proteins and therefore evolutionary relationships, since for proteins, function follows from form (shape). One of the basic approaches to structure analysis is protein fold recognition (protein fold is a 3-D pattern), which is applied when there is no significant sequence similarity between structura...

متن کامل

Colorectal Cancer and Colitis Diagnosis Using Fourier Transform Infrared Spectroscopy and an Improved K-Nearest-Neighbour Classifier

Combining Fourier transform infrared spectroscopy (FTIR) with endoscopy, it is expected that noninvasive, rapid detection of colorectal cancer can be performed in vivo in the future. In this study, Fourier transform infrared spectra were collected from 88 endoscopic biopsy colorectal tissue samples (41 colitis and 47 cancers). A new method, viz., entropy weight local-hyperplane k-nearest-neighb...

متن کامل

Distance Metric Learning: A Comprehensive Survey

Many machine learning algorithms, such as K Nearest Neighbor (KNN), heavily rely on the distance metric for the input data patterns. Distance Metric learning is to learn a distance metric for the input space of data from a given collection of pair of similar/dissimilar points that preserves the distance relation among the training data. In recent years, many studies have demonstrated, both empi...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2001